Skip to content

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619

Merged
francesco-furlan merged 4 commits into
mainfrom
shield/fix-allowlist-waiter-hook-cleanup
May 25, 2026
Merged

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619
francesco-furlan merged 4 commits into
mainfrom
shield/fix-allowlist-waiter-hook-cleanup

Conversation

@francesco-furlan

@francesco-furlan francesco-furlan commented May 12, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Fixes a stuck-cluster state observed on a customer running shield-1.37.1 on GKE Autopilot via Flux: when the pre-install shield-host-allowlist-waiter Job times out or fails, Helm's hook executor (per helm/helm pkg/action/hooks.go) applies the HookSucceeded delete-policy to all previously-succeeded hooks in the batch, sweeping the waiter's SA/CR/CRB while the failed Job survives. The Job then loops FailedCreate forever against a missing ServiceAccount.

Note on history
This PR originally also decoupled the waiter SA/CR/CRB from host.rbac.create via a dedicated gke_autopilot.allowlist_waiter.create_rbac flag. That work has been superseded by #2633, which landed a richer per-resource RBAC schema (gke_autopilot.allowlist_waiter.rbac.{create,service_account.create,cluster_role.create}). The original commit was dropped during the rebase onto main; this PR no longer introduces RBAC plumbing of its own.

Three logically separable changes, one commit each (kept independently bisectable), on top of main's RBAC sub-toggle work:

  1. The core fix:

    • Drop hook-succeeded from the SA/CR/CRB helm.sh/hook-delete-policy so they survive Helm's HookSucceeded sweep on Job failure.
    • Add hook-failed to the Job's helm.sh/hook-delete-policy so a failed Job is reaped instead of looping FailedCreate.
    • AllowlistSynchronizer hook policy unchanged (already before-hook-creation only — it must persist past the waiter).
  2. Emit kubectl describe and full YAML of the AllowlistSynchronizer on wait failure, so the next on-caller has actionable diagnostics instead of a bare exit (the customer's first failure was unrecoverable because pod logs were already evicted by the time investigation started). No new RBAC needed.

  3. Add gke_autopilot.allowlist_waiter.active_deadline_seconds (default 300) wired into Job.spec.activeDeadlineSeconds. Belt-and-suspenders Job-level guard against pod hangs before the inner kubectl wait timeout fires (image-pull stalls, scheduler delays, admission webhook hangs).

Checklist

  • Title of the PR starts with type and scope
  • Chart Version bumped for the respective charts (1.39.0 → 1.40.0)
  • Variables are documented in the README.md
  • Check GithubAction checks (like lint) to avoid merge-check stoppers
  • All test files are added in the tests folder of their respective chart and have a "_test" suffix

@francesco-furlan francesco-furlan requested a review from a team as a code owner May 12, 2026 14:46
…k failure

Helm 3's hook executor applies the HookSucceeded delete-policy to both
the failed hook and all previously-succeeded hooks in the batch (see
helm/helm pkg/action/hooks.go execHook). When the allowlist-waiter Job
(weight 5) fails, Helm sweeps the SA/CR/CRB (weight -5) because they
carried hook-succeeded — leaving the failed Job referencing a missing
SA, which produces a FailedCreate retry storm bounded only by manual
intervention.

Two surgical changes:

  * Drop hook-succeeded from the SA/CR/CRB delete-policy. They now
    persist across failures (and across successes — a lingering tiny
    SA/CR/CRB that the next install's before-hook-creation reclaims).
  * Add hook-failed to the Job's delete-policy so a failed Job is
    reaped instead of looping forever against a deleted SA.

The AllowlistSynchronizer is unchanged (already before-hook-creation
only — it must persist after the waiter exits since host-shield pods
rely on Warden seeing it).
When the waiter's kubectl wait times out, the Pod previously exited with
just the bare wait error, leaving no diagnostics by the time the next
on-caller arrived (especially after kubelet log eviction).

On non-zero exit, the script now emits a kubectl describe and full YAML
of the AllowlistSynchronizer to stderr before exiting with the original
error code. The existing RBAC (get/list/watch on allowlistsynchronizers)
is sufficient; events lookup is intentionally skipped to avoid widening
the ClusterRole.
…econds

The Job's only existing timeout lives inside the container script
(kubectl wait --timeout). If the Pod hangs before that line executes
(image-pull stall, scheduler delay, admission webhook latency), the
Job can outrun Helm's pre-install hook timeout.

Add a Job-level activeDeadlineSeconds guard via the new
gke_autopilot.allowlist_waiter.active_deadline_seconds value, defaulting
to 300 (180s of headroom over the default 120s wait timeout).
@francesco-furlan francesco-furlan force-pushed the shield/fix-allowlist-waiter-hook-cleanup branch from d53d85f to daf64fb Compare May 25, 2026 08:38
@francesco-furlan francesco-furlan merged commit a3329c9 into main May 25, 2026
8 checks passed
@francesco-furlan francesco-furlan deleted the shield/fix-allowlist-waiter-hook-cleanup branch May 25, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants